## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] 1599
## [1] 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
RedWineData dataset has 13 variables of 1599 entries: “fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”, “free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”, “quality” Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best).
##
## Pearson's product-moment correlation
##
## data: RedWineData$quality and RedWineData$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: quality and fixed.acidity
## t = -0.62012, df = 215, p-value = 0.5358
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17445680 0.09144487
## sample estimates:
## cor
## -0.04225416
As it is illustrated above in the charts and histograms of variables, there are some outliners that stand out in the chart. Both distributions for fixed acidity and volatile acidity have long positive tails, this makes their mean higher than their medians, and make median better measure of central value. Citric acid distribution looks slightly bimodal and there are few outliers as well. one intresting thing that i noticed is unsual spikes around 0.0 g/dm^3 and 0.5 g/dm^3, this may indicate few concentrations are more common than others. Residual sugar is highly positively skewed. In addition, the plot contains two peaks, it is visioble commonly in lots of plots, could be mainly due to wine type. Density and sulphates ditributions, like others, has long tails. Alcohol distribution is slightly skewed. the mean and median values of alcohol distribution are almost same. The minimum of alcohol in all wines on our dataset is 8%.
We can even remove outliers if we find it appropriate, that will make the following analysis more robust. http://www.public.iastate.edu/~maitra/stat501/lectures/Outliers.pdf However, at this stage I decided to include outliners in order for the reader to have a better underestanding of our current dataset.
I myself am a Red Wine lover. In this dataset first variable that caught my interest was the quality of wine. It is usually ranked from 1 to 10, however, in this dataset is ranked from 1 to 8. Here in the bellow histogram, I categorized the data set by good(>=7) average(5 =< Avg <7) and poor(<5) wine based on its quality.
## poor average good
## 63 1319 217
Looking at the first rounds of histograms that I created to look at the variables, it appears that Qualitatively, residual sugar and chlorides have extreme outliers. Citric has a large number of zero values, however, i’m wondering whether this is a case of non-reporting of values. It looks like density and pH are normally distributed, with only few outliers. sulfur dioxides, fixed and volatile acidity, sulphates, and alcohol seem to be long-tailed.
Arguments Log10: log is a character string which contains “x” if the x axis is to be logarithmic. here I used log10
ggplot(data = RedWineData,
aes(x = citric.acid)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = RedWineData,
aes(x = fixed.acidity)) +
geom_histogram() +
scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = RedWineData,
aes(x = volatile.acidity)) +
geom_histogram() +
scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
we can see that fixed.acidity and volatile.acidity appear to be normally-distributed.
I created a new variable as qualityrating to rate the quality of wine as good average and poor.
I addressed the distributions in the ‘Distributions and outliers’ section. Here I will continue to visualize the outliers by Boxplots. But it is important to mention that at the end I did not perform any operations on the data to tidy or adjust or change the form of the data here at Univariate Analysis section. However, I will do so with the Bivariable section. In order to make visualization by boxplots, first I define a new function to create boxplot for each variable.
# Bivariate Plots Section
It showes at the above plots that a good wine has a lower pH , higher alcohol and higher acidic (all three acidic kind examined here), as an example.
some statistics on quality of wines. the question to be answered is, which are the lements that have most influence in the quality of wine?
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## log10.residual.sugar log10.chlordies free.sulfur.dioxide
## 0.02353331 -0.17613996 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## log10.sulphates alcohol
## 0.30864193 0.47616632
and here is the result of correlations: fixed.acidity volatile.acidity citric.acid log10.residual.sugar 0.12405165 -0.39055778 0.22637251 0.02353331 log10.chlordies free.sulfur.dioxide total.sulfur.dioxide density -0.17613996 -0.05065606 -0.18510029 -0.17491923 pH log10.sulphates alcohol -0.05773139 0.30864193 0.47616632
As it appears, alcohol, sulphates, volatile acidity and citric acide have the highest correlations to quality.
Because the plot could get very tense, crowded and hard to read, I used facet_wrap(~qualityrating) to reduce the crowdedness of my models and plots. I tried to identify and illustrate the 4 most correlated features to the quality of wines. As it is shown, higher citric acid and lower volatile acid could produce a better (in terms of quality) wine. higher sulphates and higher alcohol(%) also shared contribution to a more high-quality wine.
I have done rating as good, average and poor. I named the new variable quality rating. in this scattered plot I neglected the average rate and only considered good and poor quality wines. I initially wanted to see the correlation of alcohol and volatile acidity on the quality of wine. As it is visible, higher volatile acidity brings down the quality of wine. Higher quality wines also tend to have higher alcohol. as a result, i can see that higher percentage of alcohol combined with lower volatile acidity produced a better wine in terms of quality.
Here I will try to visualise the correlation of each of my variables(in the first plot) with quality in order to support my argument.
Here in these four boxplots we can see the effect of alcohol in the quality of wine. I personaly don’t like the argument of ‘higher alcohol results in higher quality of wine’. In fact I argue that the combination of higher alcohol and other factors will produce higher qualities, and in here this case is the acidic combination. and the outliers here also showes that alcohol by itself alone is not a strong indicator of high quality wine.
## Warning: position_dodge requires non-overlapping x intervals
## Warning: position_dodge requires non-overlapping x intervals
## Warning: position_dodge requires non-overlapping x intervals
## Warning: position_dodge requires non-overlapping x intervals
As I have identified it at the first plot here, higher acidity or lower pH is recognized in higher quality wines. There are four colorful plots that illustrate correlation between the quality of wine and alcohol and PH,Citric Acid,Fixed Acidity and Volatile Acidity.
This EDA or exploratory data analysis helped me gain insights about the redwinequality dataset. i was able to visualize relationships and correlations of different variables. I was able to identify the most related variables to the quality of red wine. As it appears, alcohol, sulphates, volatile acidity and citric acide have the highest correlations to the quality of red wines. I am interested to discover quality patterns of white wines to see if they share some similarities.
Extra information on Variable description
Description of attributes:
Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily) Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste Citric acid: found in small quantities, citric acid can add ???freshness??? and flavor to wines Residual sugar: the amount of sugar remaining after fermentation stops, it???s rare to find wines with less than 1 gram/liter and wines with greater then 45 grams/liter are considered sweet Chlorides: the amount of salt in the wine Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of the wine Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine Density: the density of water is close to that of water depending on the percent alcohol and sugar content pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant Alcohol: the percent alcohol content of the wine Quality: output variable (based on sensory data, score between 0 and 10)
I also would like to expand a little bit on the struggles and successes through the analysis. working with R studio was indeed the easiest project throughout this Nanodegree. I had a very smooth time completing this project specially compare to OpenStreetMap. Having said that, there were few logical arguments that I needed to read extensively on the internet or receive help from mentors to understand. especially using factor element (color = factor(quality)) for the Multivariate the analysis helped me a lot to better describe my plots and better explain my arguments.
resources: https://www.bbr.com/wine-knowledge/faq-quality https://en.wikipedia.org/wiki/Red_wine https://www.rstudio.com